Red Ventures Data Analysis Assessment - Yuvraj Dhadwal 10/13/2024¶
Our first goal is to discover if there are relationships between any of the features in this dataset and approval for a loan. An easy way to see this is looking at the correlation between different features and appoval for loan.
# First step is to import all the libraries
import pandas as pd
import xgboost as xgb
import plotly.express as px
import numpy as np
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import balanced_accuracy_score
# Ensure that plotly widgets are saved to HTML file once notebook is exported
import plotly.io as pio
pio.renderers.default = 'notebook'
# Second step is to import in the dataset
data = pd.read_excel('Pre-Super_Day_candidate_dataset__28candidate_29.xlsx')
Okay, so we will go through every variable and check to see if there is a relationship between that variable and approval.
- The first feature is User ID, this variable is very clearly not correlated with approval.
- The second feature is Applications. From a cursorly look, it seems that this feature is always valued at one. However, we should double check
# bincount is a function in numpy that counts the frequency of numbers and multiplies it by that number
# so in this case it will count the frequency of ones and then multiply by one thus resulting in the same number
print(np.bincount(data['applications'])[1])
100000
As expected, every piece of data has the same number of applications in this dataset (which is 1), thus this feature is also not influential on approval of loan application.
The next feature is reason. To check whether there is any relationship between Reason and Application Approval, we will create a pie chart that represents the frequency of all reasons and compare this pie chart with one that represents the frequency of all reasons given that the application was approved.
# Creates a reasons numpy array where it saves the reasons for each approved loan
reasons = np.array([data['Reason'][i] for i in range(len(data)) if data['Approved'][i] == 1])
# Converts into pandas series to perform value_counts function
reasons_series = pd.Series(reasons)
approved_reason_counts = reasons_series.value_counts()
reason_counts = data['Reason'].value_counts()
# Using plotly to create plot and add subplots within
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Approved Loan Reasons Distribution",
"Applicant Loans Distribution"),
specs=[[{'type': 'domain'}, {'type': 'domain'}]])
fig.add_trace(px.pie(values=approved_reason_counts.values,
names=approved_reason_counts.index).data[0],
row=1, col=1)
fig.add_trace(px.pie(values=reason_counts.values,
names=reason_counts.index).data[0],
row=1, col=2)
fig.update_layout(title_text="Loan Reasons Distribution: Approved Loans vs Applicants", showlegend=True)
fig.show()
Based on the two pie charts above, there is little to no correlation between reason for loan request and the approval of that loan request. In fact, when I first made this chart I thought I made a mistake because they were so similar.
So far, all the variables have had very little correlation with loan approval. However, I think that may change with the next variable: Loan Amount. To see if there is any correlation between these two variables, I will use Pearson correlation coefficient function provided by Pandas.
# Uses Pandas Pearson Correlation Coefficient function that is used to compare two Pandas Series
print(data['Loan_Amount'].corr(data['Approved']))
-0.04008939145471979
A correlation so close to 0 deems that loan amount also has no bearing on the approval of the application. However, to be certain, we can double check by graphing a scatter plot of all approved loan amounts.
# Creates numpy array that saves all the loan amounts of each accepted loan request
approved_loans = np.array([data['Loan_Amount'][i] for i in range(len(data)) if data['Approved'][i] == 1])
# Creates a scatter plot using plotly
fig1 = px.scatter(x=range(len(approved_loans)), y=approved_loans,
labels={'x': 'Loan Index', 'y': 'Loan Amount'},
title="Scatter Plot of Approved Loan Amounts")
fig1.show()
Yea, if I had any doubts before that maybe loan amount and approval were correlated, this scatter plot confirms that there is no relationship.
The next variable we need to check is FICO score. I don't know much about loan approval, however, I know that there is a direct relationship between your FICO score and loan approval, but let us double check that this commonly held fact is also true in this dataset.
# Same correlation function from Pandas
print(data['FICO_score'].corr(data['Approved']))
0.26958715790146953
Okay, so there definitely is some positive correlation between your FICO score and if your application is approved. Let us graph this to double check.
# Saves numpy array of all FICO scores of approved loan requests
approved_FICOs = np.array([data['FICO_score'][i] for i in range(len(data)) if data['Approved'][i] == 1])
# Creates two scatterplots to compare FICO scores of all loan requests and the ones that were approved
fig6 = px.scatter(x=range(len(approved_FICOs)), y=approved_FICOs,
labels={'x': 'Loan Index', 'y': 'FICO Score'},
title="Scatter Plot of FICO Scores for Approved Applications")
fig7 = px.scatter(x=range(len(data)), y=data['FICO_score'],
labels={'x': 'Loan Index', 'y': 'FICO Score'},
title="Scatter Plot of FICO Scores for All Applications")
fig7.show()
fig6.show()
Wow! We found our first direct relationship between a feature and approval! It is clear from the correlation between the two features and the two graphs that there is definitely a positive relationship where applicants with higher FICO scores are much more likely to have their loan applications approved.
The next variable is FICO score grouping. However, your FICO grouping based directly on your FICO score so these two variables are essentially both telling us the same information. I have put the mapping in the table below.
| FICO Score | FICO Score Grouping |
|---|---|
| 300 - 579 | Poor |
| 580 - 669 | Fair |
| 670 - 739 | Good |
| 740 - 799 | Very Good |
| 800 - 850 | Excellent |
While this feature is redundent, it would be easier to visualize relationship between FICO Score and Loan Application Approval compared to simply plotting FICO scores like we did above. Thus we will create two pie charts comparing Applicant FICO Grouping vs Approved FICO Grouping
# Numpy array that saves the FICO score groups of approved loan requests
approved_fico_group = np.array([data['Fico_Score_group'][i] for i in range(len(data)) if data['Approved'][i] == 1])
# Converts to Pandas Series to leverage value_counts function
approved_fico_group_series = pd.Series(approved_fico_group)
approved_fico_group_counts = approved_fico_group_series.value_counts()
fico_group_counts = data['Fico_Score_group'].value_counts()
# Creates a color map to ensure that plotly uses same colors for both plots
color_map = {
'poor': '#Ffc0cb',
'fair': '#636EFA',
'good': '#EF553B',
'very_good': '#00CC96',
'excellent': '#AB63FA'
}
# Creates an order for the categories because plotly auto orders which can confuse readers
fico_order = ['poor', 'fair', 'good', 'very_good', 'excellent']
# Re-indexes Pandas Series and fills in any empty spots (should be none) with a FICO score group of 0
approved_fico_group_counts = approved_fico_group_counts.reindex(fico_order).fillna(0)
fico_group_counts = fico_group_counts.reindex(fico_order).fillna(0)
# Converts both Pandas Series into Pandas Dataframes
approved_fico_df = pd.DataFrame({
'Fico_Score_group': approved_fico_group_counts.index,
'count': approved_fico_group_counts.values
})
fico_df = pd.DataFrame({
'Fico_Score_group': fico_group_counts.index,
'count': fico_group_counts.values
})
# Creates two subplots to compare the FICO score groups of applicants and approvals
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Approved Loan Fico Group Distribution",
"Applicant Loans Fico Group Distribution"),
specs=[[{'type': 'domain'}, {'type': 'domain'}]])
fig.add_trace(px.pie(approved_fico_df,
values='count',
names='Fico_Score_group',
color='Fico_Score_group',
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=1)
fig.add_trace(px.pie(fico_df,
values='count',
names='Fico_Score_group',
color='Fico_Score_group',
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=2)
fig.update_layout(title_text="Fico Group Distribution: Approved Loans vs Applicants", showlegend=True)
fig.show()
It was good that we ran this analysis, now we can very clearly see the how FICO scores impact approval.
| FICO Score Grouping | Pie Chart Change | Percentage Change in Representation from Applicants to Approvals |
|---|---|---|
| Poor | 28.5% --> 7.22% | 74.66% decrease in Representation |
| Fair | 36.5% --> 21.4% | 41.37% decrease in Representation |
| Good | 27.8% --> 45.7% | 64.39% increase in Representation |
| Very Good | 5.1% --> 16.6% | 225.49% increase in Representation |
| Excellent | 2.19% --> 9.13% | 316.89% increase in Representation |
This table ceases any doubts in the positive correlation between FICO Score and Loan Approval. As mentioned earlier, while FICO Score Grouping is redudent because we have already seen the FICO scores, this is a good way to visualize how important FICO Scores are in Loan Approval.
The next variable to check is employment status of the applicant.
# Creates numpy array that saves employment status of all approved loan requests
employed = np.array([data['Employment_Status'][i] for i in range(len(data)) if data['Approved'][i] == 1])
# Turning into Pandas Series - there is probably someway faster of doing this but at this point I am just
# copy pasting my old code because it is the same code for each variable
employed_series = pd.Series(employed)
approved_employed_counts = employed_series.value_counts()
employed_counts = data['Employment_Status'].value_counts()
# Making subplots with Plotly!
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Approved Loan Employment Distribution",
"Applicant Employment Distribution"),
specs=[[{'type': 'domain'}, {'type': 'domain'}]])
fig.add_trace(px.pie(values=approved_employed_counts.values,
names=approved_employed_counts.index).data[0],
row=1, col=1)
fig.add_trace(px.pie(values=employed_counts.values,
names=employed_counts.index).data[0],
row=1, col=2)
fig.update_layout(title_text="Employment Distribution: Approved Loans vs Applicants", showlegend=True)
fig.show()
There seems to be a strong relationship between employment status and loan approval with a favorability towards those who are employed fulltime. Thus, this feature is important. This makes sense also because if you have a full time job, you are more likely to pay your monthly loan payment.
The next feature to analyze will be employment sector of our applicants.
# Employment Sector for unemployed individuals are empty, so I must fill them
data['Employment_Sector'] = data['Employment_Sector'].replace('', None).fillna('unemployed')
# Same logic as with every other cell above
employment = np.array([data['Employment_Sector'][i] for i in range(len(data)) if data['Approved'][i] == 1])
employment_series = pd.Series(employment)
approved_employment_counts = employment_series.value_counts()
employment_counts = data['Employment_Sector'].value_counts()
# Produces plotly plots to compare visually
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Approved Loan Employment Distribution",
"Applicant Employment Distribution"),
specs=[[{'type': 'domain'}, {'type': 'domain'}]])
fig.add_trace(px.pie(values=approved_employment_counts.values,
names=approved_employment_counts.index).data[0],
row=1, col=1)
fig.add_trace(px.pie(values=employment_counts.values,
names=employment_counts.index).data[0],
row=1, col=2)
fig.update_layout(title_text="Employment Distribution: Approved Loans vs Applicants", showlegend=True)
fig.show()
While, at first glance, it may seem as if there is no difference in these pie charts, there are differences in the percentages. Thus, employment type has an effect on loan approval. This might because of other factors that are not necessarily the occupation. For example, a software engineering job pays more than a teacher and so software engineers might have a higher FICO score. Again, all we have found here is weak correlation, not really causation.
The next variable to check would be monthly gross income.
# Same Pandas correlation function as described above
print(data['Monthly_Gross_Income'].corr(data['Approved']))
0.17179137566825903
Clearly there is a positive correlation between Gross Monthly Income and Loan Application Approval. Which solidifies my point earlier where rather than Industry Sector, an individuals income seems to be much more important. Its just that some industry sectors do not pay as much as others.
Let us now analyze the Gross Housing Payment.
print(data['Monthly_Housing_Payment'].corr(data['Approved']))
-0.0275580000194345
There seems to be no relationship between Monthly Housing Payment.
However, I do want to see if maybe there is a relationship between Income/Housing and Approval. For example, is there a relationship between Income - Housing and Approval or Income / Housing and Approval.
# Element wise subtraction and division to see if there is any correlation
print((data['Monthly_Gross_Income'] - data['Monthly_Housing_Payment']).corr(data['Approved']))
print((data['Monthly_Gross_Income'] / data['Monthly_Housing_Payment']).corr(data['Approved']))
0.1805812211933719 0.17788810736821553
There is higher correlation between Montly Gross Income - Monthly Housing Payment and Approval compared to the other correlation values we have seen. This makes sense, the lenders probably care much more about your take home pay rather than gross income. For example, if you live in Silicon Valley, the worth of the dollar is only 80 cents so your high income might not be as powerful compared to if you were living in Idaho.
The final variable to check is whether an individual has ever gone bankrupt or foreclosed.
# Same logic as all previous python cells except this time for bankruptcies and foreclosures
bankrupt = np.array([data['Ever_Bankrupt_or_Foreclose'][i] for i in range(len(data)) if data['Approved'][i] == 1])
bankrupt_series = pd.Series(bankrupt)
approved_bankrupt_counts = bankrupt_series.value_counts()
bankrupt_counts = data['Ever_Bankrupt_or_Foreclose'].value_counts()
fig = make_subplots(rows=1, cols=2,
subplot_titles=("Approved Loan Bankruptcy/Foreclosure Distribution",
"Applicant Bankruptcy/Foreclosure Distribution"),
specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(px.pie(values=approved_bankrupt_counts.values,
names=approved_bankrupt_counts.index).data[0],
row=1, col=1)
fig.add_trace(px.pie(values=bankrupt_counts.values,
names=bankrupt_counts.index).data[0],
row=1, col=2)
fig.update_layout(title_text="Bankruptcy/Foreclosure Distribution (Approved vs Applicants)", showlegend=False)
fig.show()
There is a negative correlation between bankrupty/foreclosure and loan application approval. Which makes sense, because lenders typically want to see return on their investment of the loan and so if there are any hurdles in that process, a bank may not want to lend to you.
Through all the analysis above, I have deduced that the most important features to predict if a loan will be approved are FICO Scores, Employment Status, and whether that applicant has ever filed for Bankruptcy or Foreclosure. I also noticed a high correlation between (Montly Gross Income - Montly Housing Payment) and Loan Approval. This is one feature transformation that would improve our predictive processes. All together, these 4 variables are a very powerful way to predict whether an application will be approved.
This being said, some features should not be collected because they have little to no correlation to loan approval. These features are User ID, Application, Reason, and Loan Amount. FICO Score Grouping is redudentent with FICO Scores and thus should also be considered for removal despite its high correlation with loan approval. While there is a relationship between employment sector and loan request approval, I think the more likely cause is correlation not causation and thus should also not be looked at very heavily.
Our next goal is to learn more about each of these lenders and see what kind of applicants do they prefer to offer loans to. To this we will first create have three seperate datasets for each of our lenders.
# Creating new Pandas DataFrames for each application lender
A = data[data['Lender'] == 'A']
B = data[data['Lender'] == 'B']
C = data[data['Lender'] == 'C']
Let's look more closely at the variables that I think are important in determining whether you receive a loan and see how these lenders weigh each of them.
# Checks Employment Status value_counts for each new dataframe
status_counts_A = A['Employment_Status'].value_counts()
status_approved_counts_B = B['Employment_Status'].value_counts()
status_counts_C = C['Employment_Status'].value_counts()
# Creates subplots to compare them visually
fig = make_subplots(rows=1, cols=3,
subplot_titles=("Lender A", "Lender B", "Lender C"),
specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]])
fig.add_trace(px.pie(values=status_counts_A.values, names=status_counts_A.index).data[0], row=1, col=1)
fig.add_trace(px.pie(values=status_approved_counts_B.values, names=status_approved_counts_B.index).data[0], row=1, col=2)
fig.add_trace(px.pie(values=status_counts_C.values, names=status_counts_C.index).data[0], row=1, col=3)
fig.update_layout(title_text="Employment Status Applicants by Lender (A, B, C)", showlegend=True)
fig.show()
This graph is very telling! It seems that Lender A has very few unemployed people applying, while Lender B has less part-time employed people applying whereas Lender C has plenty of unemployed and part-time employed individuals applying for loans.
However, this is just applicants, let us see who is getting accepted by each of these lenders.
# Creates new dataframes for each lender where the application is approved
A_approved = A[A['Approved'] == 1]
B_approved = B[B['Approved'] == 1]
C_approved = C[C['Approved'] == 1]
# Creates value_counts for each Lender approved application
status_approved_counts_A = A_approved['Employment_Status'].value_counts()
status_approved_counts_B = B_approved['Employment_Status'].value_counts()
status_approved_counts_C = C_approved['Employment_Status'].value_counts()
# Creating subplots to visually compare
fig = make_subplots(rows=1, cols=3,
subplot_titles=("Lender A", "Lender B", "Lender C"),
specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]])
fig.add_trace(px.pie(values=status_approved_counts_A.values, names=status_approved_counts_A.index).data[0], row=1, col=1)
fig.add_trace(px.pie(values=status_approved_counts_B.values, names=status_approved_counts_B.index).data[0], row=1, col=2)
fig.add_trace(px.pie(values=status_approved_counts_C.values, names=status_approved_counts_C.index).data[0], row=1, col=3)
fig.update_layout(title_text="Employment Types by Lender (A, B, C)", showlegend=True)
fig.show()
Now this is fascinating. It seems that people who are applying to these loans know who is most likely to get approved. For example, unemployed people recognize their best shot to get a loan is from Lender C, whereas part-time employed people see that they have the least likelihood of approval from Lender B and thus do not apply there as much.
If this trend is true, we should also see it in the types of industry sectors loan request approvals for Lender C.
sector_approved_counts_A = A_approved['Employment_Sector'].value_counts()
sector_approved_counts_B = B_approved['Employment_Sector'].value_counts()
sector_approved_counts_C = C_approved['Employment_Sector'].value_counts()
fig = make_subplots(rows=1, cols=3,
subplot_titles=("Lender A", "Lender B", "Lender C"),
specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]])
fig.add_trace(px.pie(values=sector_approved_counts_A.values, names=sector_approved_counts_A.index).data[0], row=1, col=1)
fig.add_trace(px.pie(values=sector_approved_counts_B.values, names=sector_approved_counts_B.index).data[0], row=1, col=2)
fig.add_trace(px.pie(values=sector_approved_counts_C.values, names=sector_approved_counts_C.index).data[0], row=1, col=3)
fig.update_layout(title_text="Employment Sectors by Lender (A, B, C)", showlegend=True)
fig.show()
Again, just as expected, Lender C is accepting a lot more unemployed people into their loan program than either of their two competitors. Other than that it doesn't seem like job occupation has any effect on if you will be selected.
Let's see if bankruptcy/foreclosures have a different impact on either of the lenders.
# Taking the value_counts function from Pandas for approved and applicants to each loan program
bank_A = A['Ever_Bankrupt_or_Foreclose'].value_counts()
bank_B = B['Ever_Bankrupt_or_Foreclose'].value_counts()
bank_C = C['Ever_Bankrupt_or_Foreclose'].value_counts()
bank_approved_counts_A = A_approved['Ever_Bankrupt_or_Foreclose'].value_counts()
bank_approved_counts_B = B_approved['Ever_Bankrupt_or_Foreclose'].value_counts()
bank_approved_counts_C = C_approved['Ever_Bankrupt_or_Foreclose'].value_counts()
# Creating subplots through Plotly to compare visually
fig = make_subplots(rows=2, cols=3,
subplot_titles=("Lender A Applicants", "Lender B Applicants", "Lender C Applicants",
"Lender A Approvals", "Lender B Approvals", "Lender C Approvals"),
specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'},
{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(px.pie(values=bank_A.values, names=bank_A.index).data[0], row=1, col=1)
fig.add_trace(px.pie(values=bank_B.values, names=bank_B.index).data[0], row=1, col=2)
fig.add_trace(px.pie(values=bank_C.values, names=bank_C.index).data[0], row=1, col=3)
fig.add_trace(px.pie(values=bank_approved_counts_A.values, names=bank_approved_counts_A.index).data[0], row=2, col=1)
fig.add_trace(px.pie(values=bank_approved_counts_B.values, names=bank_approved_counts_B.index).data[0], row=2, col=2)
fig.add_trace(px.pie(values=bank_approved_counts_C.values, names=bank_approved_counts_C.index).data[0], row=2, col=3)
fig.update_layout(title_text="Bankruptcies and Foreclosures by Lender (A, B, C) and Approval", showlegend=True)
fig.show()
It seems like more people who have had bankruptcies or foreclosures are applying to Lender C, and more are getting accepted by Lender C. Both Lender A and Lender B dislike individuals with that on their record. In other words, there is a negative correlation between bankruptcie and foreclosures and loan request approval, however Lender A and B have stronger negative correlations compared to Lender C who is more lenient. Bank B on the other hand didn't accept a single person with a bankruptcy or foreclosure on their record.
We should also check FICO score for all these lenders.
# Creating value_counts for each lender and for appliations and approvals and also resetting their
# index ordering while filling empty cells with the value 0
A_fico = A['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
B_fico = B['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
C_fico = C['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
A_approved_fico = A_approved['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
B_approved_fico = B_approved['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
C_approved_fico = C_approved['Fico_Score_group'].value_counts().reindex(fico_order).fillna(0)
# Creating subplots to compare visually
fig = make_subplots(rows=2, cols=3,
subplot_titles=("Applicant Lender A",
"Applicant Lender B",
"Applicant Lender C",
"Approved Lender A",
"Approved Lender B",
"Approved Lender C"))
fig.add_trace(px.bar(x=A_fico.index, y=A_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=1)
fig.add_trace(px.bar(x=B_fico.index, y=B_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=2)
fig.add_trace(px.bar(x=C_fico.index, y=C_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=3)
fig.add_trace(px.bar(x=A_approved_fico.index, y=A_approved_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=2, col=1)
fig.add_trace(px.bar(x=B_approved_fico.index, y=B_approved_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=2, col=2)
fig.add_trace(px.bar(x=C_approved_fico.index, y=C_approved_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=2, col=3)
fig.update_layout(title_text="Fico Group Distribution: Approved Loans vs Applicants", showlegend=False)
fig.show()
From this analysis, it seems that applicants that have a credit score on the lower end of the spectrum, seemed to be disproportionately denied by Lender B. We can check this quickly to see if there is a high positive correlation between FICO Score and Approval from Lender B
print(B['FICO_score'].corr(B['Approved']))
0.31694935386965367
As suspected, there is a much more direct positive correlation in FICO Score and Approval with Lender B compared to the average of all three which we analyzed previously.
Now let's see an applicants chances of being approved solely based on their FICO Score.
# Element wise division basically taking the count of each FICO score grouping for each lender approval and
# dividing it by the FICO score grouping for each lender approval
A_chances_fico = A_approved_fico / A_fico
B_chances_fico = B_approved_fico / B_fico
C_chances_fico = C_approved_fico / C_fico
# Visual comparison using Plotly Subplots
fig = make_subplots(rows=1, cols=3,
subplot_titles=("Lender A",
"Lender B",
"Lender C"))
fig.add_trace(px.bar(x=A_chances_fico.index, y=A_chances_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=1)
fig.add_trace(px.bar(x=B_chances_fico.index, y=B_chances_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=2)
fig.add_trace(px.bar(x=C_chances_fico.index, y=C_chances_fico.values,
color_discrete_map=color_map,
category_orders={'Fico_Score_group': fico_order}).data[0],
row=1, col=3)
fig.update_layout(title_text="Fico Group Distribution: Approved Loans Divided by Applicants", showlegend=False)
fig.show()
If FICO Score was the only variable considered for loan approval, then the graphs would be a lot closer to peaking at 100% rather than 55%. If FICO Scores were not correlated to loan approval, we would expect to see approximately uniform bar charts. This is further proof that FICO scores are vital to determining approval for a loan. However, it is clear that these three lenders put different emphasis on this one variable.
As expected, Lender B has an exponential drop off in loan approvals as fico score decreases. Surprisingly, Lender C has the highest approval rate of all the other lenders in almost every category of FICO Score. This leads me to think that we should focus the kinds of people who have low credit scores, bankruptcy/foreclosure, unemployment to Lender C. Wheras Lender B will not lend to anyone unless they meet a strict standard of full employment, high credit scores, and no bankruptcy/forclosures.
Let's take a look at how Income and Approval are correlated for Loan Approval. Maybe some of these lenders cater to lower income people.
# Comparing all correlations
print("All Approvals: " + str((data['Monthly_Gross_Income'] -
data['Monthly_Housing_Payment']).corr(data['Approved'])))
print("Lender A: " + str((A['Monthly_Gross_Income'] - A['Monthly_Housing_Payment']).corr(A['Approved'])))
print("Lender B: " + str((B['Monthly_Gross_Income'] - B['Monthly_Housing_Payment']).corr(B['Approved'])))
print("Lender C: " + str((C['Monthly_Gross_Income'] - C['Monthly_Housing_Payment']).corr(C['Approved'])))
All Approvals: 0.1805812211933719 Lender A: 0.18503313372267918 Lender B: 0.2059550607329514 Lender C: 0.17619571256288533
Seems like there is a stronger positive correlation between Monthly Take Home Pay and Approval for Lender B compared to the other two lenders. However, all of them largely have similar positive correlation.
Before we complete our analysis, let us simply check the approval rate of each lender.
print('Lender A: ' + f"{len(A_approved) / len(A) * 100:.2f}%")
print('Lender B: ' + f"{len(B_approved) / len(B) * 100:.2f}%")
print('Lender C: ' + f"{len(C_approved) / len(C) * 100:.2f}%")
Lender A: 10.97% Lender B: 7.13% Lender C: 17.06%
This confirms my suspicions that Lender B is the hardest one to get an approval for because they only take people with the highest pay, FICO Score, and clean financial history. Whereas Lender C is more risky with who they offer loans to. And Lender A, stands in the middle of the other two.
Okay, I think we have completed enough analysis on this data and I have a good idea on the story that this dataset tells us about the three lenders.
Variables Relationship with Approvability¶
The most important variables in this dataset in predicting whether an application will be approved are FICO Score, Employment Status, and previous Bankruptcy/Foreclosure.
With a little bit of data preprocessing, I see there is also a strong direct correlation between Montly Take-Home Pay (Monthly Gross Income - Monthly Houseing Cost) and application approval. This is an example of some feature modifications that would improve predictive ability of approval compared to simply looking at Monthly Gross Income.
These four variables logically make sense too. FICO scoring and previous Bankruptcies/Foreclosures indicate an applicants history with responsible finances. Montly Take-Home Pay reassures the lender that that applicant has the finances to pay the monthly installments on their loan. Employment Status is a way for lenders to ensure that in the future the applicant will be able to continue to make their payments on time and have a steady, continuous cash flow.
FICO Score grouping is not necessary for collection. While it is helpful to visually show the strong positive correlation between FICO scores and loan request approval, it can be derived easily by using the table I provided earlier in this analysis.
User ID, Application, Reason for Loan, and Loan Amount all have little to no correlation to whether the application is accepted. This makes sense because the lenders don't care about this information, instead they only care about how to make a profit. This means they don't care what you with your money as long as you can pay them back with interest.
Employment Sector does have some loose connection with loan approval. However, I think this is a case of correlation rather than causation. Some employment sectors pay more, which influence important variables such as Monthly Take-Home Pay and FICO Scores, thus it seems that lenders tend to favor these sectors. Thus, this data is also unnecessary to collect.
Lenders Approval Rates¶
The lenders approval rates based on this dataset are as follows:
| Lender | Average Approval Rate |
|---|---|
| A | 10.97% |
| B | 7.13% |
| C | 17.06% |
There is a clear difference in the types of customers each lender prefers. Lender B is by far the most strict. They do not accept loan applications from people who have gone into bankruptcy or foreclosed. They also have a very strong positive correlation between FICO score and acceptance as well as a positive correlation between Monthly Take-Home Pay and acceptance. They also prefer that the applicant has a full time job. Lender C, on the other hand, is a lot more lenient on their applicants. They by far accept more people with FICO scores on the lower end and those who are not fully employed. They also are not as negatively correlated with those who have been bankrupt or foreclosed in the past (though it is still strong negative just not as strong). This is shown through the average approval rate for both where Lender C application has almost a 2.5x more liklihood of approval compared to Lender B. Lender A lands in between both.
The variables I mentioned above are good indicators where a person should apply for their loan.
Maximize Profts¶
Based on this dataset, all unemployed and bankrupt/foreclosed applicants should apply to Lender C because they have very little chance of approval with Lenders A and B. If applicant is a part time worker with an excellent FICO score, then they should apply to Lender A, else if they are part time they should apply to Lender C. If an applicant has a very good or excellent FICO score and they are fully employed with a high paying job, then they should apply to be with Lender B. All other applicants should go directly to Lender A. This way, we are maximizing chances of approval while trying to match applicants to the loans that pay us the highest bounty.
Some simple calculations, if we moved all bankrupt/foreclosed applicants to Lender C, then we would generate a profit of $3,600. If we moved all unemployed applicants to Lender C, then we would generate a profit of $35,500. If we moved all applicants with Excellent FICO Scores to Lender B, then we would generate a profit of $78,800 simply from bounties alone. Obviously, these variables themselves do not dictate approval and there may be better ways to make profit, however, these back-of-the-hand calculations show that there is a lot of lost profit in the way we are currently handling this that can be made up if we were smarter about how we recommend lenders. Also this makes it easier for people to get loans, so they also benefit more with this.
If we wanted to improve our application to be able to match customers to lenders in real time, I would recommend utilizing a multi-classification machine learning algorithm. This would be perfect because we are trying to predict which lender would approve application based on a few key features.
I would recommend XGBoost- a robust random forest decision tree multiclassification machine learning library. XGBoost models are prefered because of their ability to build robust models with high accuracies with minimal training time and data. Given applicants data, a XGBoost model would be able to predict the probability of approval from each lender. If we multiply each probability by the bounty we gain from an accepted approval, then we can see which lender to recommend to our applicant that will make us the most money. For example, XGBoost might predict that an applicant has an 80% chance of approval from Lender C, but also has a 60% chance of approval from Lender B. Based on our logic, our application would recommend Lender B because the predicted bounty would be $210 compared to $120 from Lender C. However, if an applicant had a 50% chance of approval from Lender C, but only a 10% chance of approval from Lender B, then our application would recommend Lender C because the predicted bounty would be $75 compared to $25.
This algorithm would need to trained on good data, so we first would need to implement some changes to improve data that we train on by recommending our applicants better.
However, below I did do some preliminary analysis with XGBoost.
# Take all the important features
X = data[['FICO_score', 'Employment_Status', 'Ever_Bankrupt_or_Foreclose']]
X['Take_Home_Pay'] = data['Monthly_Gross_Income'] - data['Monthly_Housing_Payment']
# Encode categorical features like Employment_Status and Bankruptcy
X_encoded = pd.get_dummies(X, drop_first=True) # One-hot encoding categorical variables
# The target variable is 'Lender', which is a multiclass target
y = np.array([data['Lender'][i] if data['Approved'][i] == 1 else 3 for i in range(len(data))])
y = pd.Categorical(y).codes
# From this we see that 0 -> No Approval, 1 -> A Approves, 2 -> B Approves, 3 -> C Approves
print(np.bincount(y))
# Split the data into training and testing sets
# Reason for hardcoding train/test is due to later calculations.
# X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=1508)
X_train = X_encoded[:80000]
y_train = y[:80000]
X_test = X_encoded[80000:]
y_test = y[80000:]
# Data is very much no loan favored, so we need to ensure the model does not say everyone gets loan rejected
sample_weights = compute_sample_weight('balanced', y_train)
# Train the XGBoost model for multi-class classification
model = xgb.XGBClassifier(objective='multi:softprob', num_class=4, eval_metric='mlogloss')
model.fit(X_train, y_train, sample_weight=sample_weights)
C:\Users\Vulca\AppData\Local\Temp\ipykernel_25852\2685573764.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
[89024 6031 1960 2985]
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='mlogloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_class=4, num_parallel_tree=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='mlogloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_class=4, num_parallel_tree=None, ...)# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print(f"Balanced Model Accuracy: {balanced_accuracy_score(y_test, y_pred) * 100:.2f}%")
print('No Loan | A | B | C')
print(np.bincount(y_pred))
print(np.bincount(y_test))
Model Accuracy: 57.27% Balanced Model Accuracy: 40.08% No Loan | A | B | C [11299 2593 3146 2962] [17646 1323 403 628]
While the accuracy score of this model is low, remember that the labels are are also poor. Many people were recommeneded to go to Lender B when it would have been better to go to Lender C and thus No Loan is more over represented in our dataset than it should.
Also this model's hyperparameters are not tuned so that will also decrease model accuracy.
y_pred = pd.Series(y_pred)
y_test = pd.Series(y_test)
# Find differences between what the model predicts and what the dataset tells us
y_diff = [[y_pred[i], y_test[i], data['Lender'][i + 80000]] for i in range(len(y_pred))
if y_pred[i] != y_test[i] and y_pred[i] != ord(data['Lender'][i + 80000]) - ord('A') + 1]
print(y_diff[:5])
[[2, 0, 'A'], [1, 0, 'C'], [3, 0, 'B'], [2, 0, 'A'], [1, 0, 'B']]
This list shows the differences between the model and the dataset.
The first element states that the model predicts that Lender B would be best, however the dataset said that Lender A would be best and was rejected.
The second element states that the model predicts that Lender A would be best, however the dataset that Lender C would be best and was rejected.
Maybe the XGBoost Model is able to correctly find the correlation between the variables above and approval and thus all these are missed oppurtunities for bounties. Now I will multiply the probability for each of these by bounty to simulate how I think the final application should be.
# Gives the models probabilities for each classification
y_pred_proba = model.predict_proba(X_test)
weights = np.array([0, 250, 350, 150])
y_pred_weights = y_pred_proba * weights
max_values = np.max(y_pred_weights, axis=1)
total_sum = np.sum(max_values)
print("Model Predicted Total Bounty for Last 20k data: " + str(total_sum))
print("Dataset Total Bounty for Last 20k data: " + str(np.sum(data['bounty'][80000:])))
Model Predicted Total Bounty for Last 20k data: 1547950.0295901902 Dataset Total Bounty for Last 20k data: 566000
It seems with our model, we are predicted to score a bounty almost 3x larger than what we are currently earning. Now you might consider that our model currently predicts approval and there is no way to check that it is correct, and this is a fair point, so we aren't certain that this is accurate. Also, to ensure in cases where the model is unsure that person will be approved by multiplied weights by probability that you should go to a particular lender. This way if the model predicts low chance that you should go to Lender B, but thats because everything else also has low probability, then there will be very little bounty.
Obviously, 3x more revenue is a little insane and we should expect much less but this very cursory exploration shows that there is plenty of room for improvement in our app.